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Abstract 

This paper uses a database of collaboration recording between Econophysics 
Scientists to study the community structure of this collaboration network, which 
with a single type of vertex and a type of undirected, weighted edge. Hierarchical 
clustering and the algorithm of Girvan and Newman are presented to analyze the 
data. And it emphasizes the influence of the weight to results of communities 
by comparing the different results obtained in different weights. A function D is 
proposed to distinguish the difference between above results. At last the paper 
also gives explanation to the results and discussion about community structure. 
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1 Introduction 

In recent years, as more and more systems in many different fields can be depicted 
as complex networks, the study of complex networks has been gradually becoming an 
important issue. Examples include the world wide web, social networks, biological 
networks, food webs, biochemical networks and so on^ [21 El El HHI- As one of 
the important properties of networks, community structure attracts us much atten- 
tion. Community structure is the groups of network vertices. Within the groups there 
have dense internal links among the nodes, but between groups the nodes loosely con- 
nected to the rest of the network j8.j. Communities are very useful and critical for us to 
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understand the functional properties of complex structure better. So the problem of 
detecting and analyzing underlying communities is an important mission to us. 

The idea of finding communities is closely related to the graph partitioning in 
graph theory, computer science and sociology 0. The study of community structure 
in networks has a long history, so several types of algorithms have been developed 
for finding the community structure in networks. Early algorithms such as Spec- 
tral bisection^] and the Kernighan-Lin algorithm^) perform poorly in many general 
cases. To overcome the problems, in recent years, many new algorithms have been 
proposed [HI I1D| I12[ I13j . As one of these algorithms, the algorithm of Girvan and 
Newman (GN) is the most successful one. It is a divisive algorithm. The idea be- 
hind it is edge betweenness, a generalization of the betweenness firstly introduced by 
Freeman |14j. The betweenness of an edge in network is defined to be the number of the 
shortest paths passing through it. It is very clearly that edges which connect commu- 
nities, as all shortest paths connect nodes in different communities have to run along 
it, have a larger betweenness value. By removing the edge with the largest betweenness 
at each step, we can gradually split the whole network into isolated components or 
communities. 

The application of GN algorithm has acquired successful results to different kinds of 
networks [HI El- Such as in (Hj, the authors use the GN algorithm to study the community 
structure of the collaboration network of jazz musicians. The analysis to the results 
reveals the presence of communities which have a strong correlation with the recording 
location of the bands, and also shows the presence of racial segregation between the 
musicians. 

In recent years, most of the real- worlds which have been studied were represented as 
non-weighted networks by neglecting lots of data. The researchers paid more attention 
to the communities under the influence of the topology of the network. However, the 
weight of edges is important and may affect the results of communities, and it can 
tell us more information than whether the edge is present or not. For example, in a 
social network there are stronger or poorer connections between individuals, and the 
weight of edges are applied to describe the different strengths. So when we try to detect 
communities in this network, we should consider the weights into the process. It may 
give us better results closely according with facts than ignoring them. 

In , we built an Econophysics Scientific Collaboration Network and gave some 
statistical results about this network. In this paper, we focus on the investigation of 
community structure of this network. We get the results of communities by using GN 
algorithm and hierarchical clustering. We also obtained the communities in different 
conditions including weighted, non-weighted, and different weights. In latest months, 
Newman has pointed out that applying the original GN algorithm to the weighted 
networks would obtain poor results, and gave the generalization of the GN algorithm 
to a weighted network |22j. In 0, Newman and Girvan define a function Q to measure 
where the best division for a given network, and also a generalization of Q to the 
weighted networks was proposed in this paper. We applied it to our network and found 
the peaks of Q correspond closely to the expected divisions. 
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The outline of this article is as follows. In Section [21 we describe in detail our 
work. First, we introduce our database and the Econophysics Scientific Collaboration 
Network which was built on the database briefly. Then we describe the algorithms and 
the definition of the different weights that were used to find underlying communities in 
our network. At last, we give the results by each condition and the compare between 
them. In Section |I] we give our conclusions. 

2 The communities acquired by different algorithms 

A network is composed of a set of vertices and edges which represent the relationship 
between the two nodes. In the Econophysicist collaboration network 2l], each node 
represents one scientist. If the two scientists have collaborated one or more papers, 
they would be connected by an edge. In order to distinguish the different level of 
collaboration, we define the weights on the edges. So it's a weighted network. Here we 
take the largest cluster from the network as the subject of our research. It is a sparse 
network including 271 nodes and 371 edges. 

The weight is the crucial factor in our network analysis. Edge weights represent 
the strength or capacity of the edges. The weight of this network is defined as: W{j = 
tanh(tij), where tij is the number of papers which the researchers have collaborated. 
The reason we prefer the tanh function in empirical studies is that, first, it has the 
saturation effect, which makes the contribution less for larger connecting times; second, 
it normalizes the maximum value to 1, which is the usual strength of edge in non- weight 
networks 21 . As the similarity is used here as the weight, the larger the weight is, the 
closer the relation between the two ends nodes is. The weight and connection provide 
us a natural description for the distance of two nodes. 

In this part, we present two methods, hierarchical clustering and the algorithm of 
GN, on the analysis of community structure in our network. Because GN algorithm 
performs well in many networks and hierarchical clustering is the principal technique 
used in social networks in current. 

In practical situation the algorithms will normally be used on networks for which 
the communities are not known ahead of time. This raises a new problem: how do we 
know when the communities found by the algorithm are good ones? To answer this 
question, in Newman proposed a measure of the quality of a particular division of 
a network, which they call it the modularity. Then they define a modularity measure 



This quantity measures the fraction of the edges in the network that connect nodes 
of the same community minus the expected value of the same quantity in a network 
with the same community division but random connections between the vertices. Aij 



by 




(1) 
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, and m = 2^2ij^j- Ci 1S ^ ne community to which vertex i is assigned. Newman 
has generalized the above measure to weighted networks [22] . Here we use the similar 
formula to our weighted network with similarity weight range from to 1: 

where Wij represents the weight in the edge between nodes i and j, W{ is the weight of 
node i: Wi = ™ij , and m = \ Y^ij w ij- 

Using hierarchical clustering method to find communities, we start from an empty 
graph with all nodes and no edges. Then we connect the edges in order of decreasing 
similarity. In our network, we use the measure dij which describes the similarity to be 
the short path between a pair (i, j) of vertices, where the shorter path represents the 
bigger similarity. When the nodes are clustered to be the communities, we define the 
distance between different communities as follows 

D vg = max da. (3) 

p, q are any two communities. The measure dij equals to the shortest path between 
a pair of vertices. 

We have got the result from above hierarchical clustering method. It shows the 
modules in this result and also a peak in Q function. The best division has 23 clusters. 

GN method has got better results for community analysis. As mentioned in the 
section of introduction, the idea behind the algorithm of GN is edge betweenness. And 
the betweenness of an edge is defined to be the number of the shortest paths passing 
through it. To search the shortest paths between any two vertices, we use the Dijkstra 
algorithm. For the determination of shortest path, the similarity weight € [0, 1] 
has been transformed to dissimilarity weight by Wij = and then Wij € [1, oo] 
is corresponding to the "distance" between nodes. All paths are calculated under this 
dissimilarity weight from now on if not mentioned. The principal ways of GN algorithm 
are as follows [3]: 

1. Calculate betweenness scores for all edges in the network. 

2. Find the edge with the highest score and remove it from the network. 

3. Recalculate betweenness for all remaining edges. 

4. Repeat from step 2 until all links are removed. 

The best result given by maximum Q has 10 clusters. The GN algorithm and the 
hierarchical clustering which are based on the equation [21 all show the modules in the 
results. In the best divisions, we analyze the communities with the data. The results 
of algorithm of GN is better than hierarchical clustering. Because the result in the 
best division of GN algorithm shows that the scientists, who are in the same university, 
institute or interested in similar research topic, are clustered to one community. It is 
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Figure 1: The best division of Econophysics Scientist Collaboration Network, with the 
divisions detected by GN algorithm represented by different colors. 



close to the reality. For example, in figurenthe members of the red community are most 
from Boston University USA. And there are other communities which the members are 
focused on the same topic, as the yellow one. Meanwhile even the hierarchical clustering 
shows the modules, the result is not consistent with the reality. 

3 The comparison of different formation of communities 

In the above section, we obtained that the results of GN algorithm and hierarchical 
clustering are different. How to quantify the difference between them? We define a 
function D to measure it. The idea behind the function is to discuss the similarity 
and dissimilarity between sets A and B. Let's discuss the similarity and dissimilarity 
of two sets A and B defined as subset of $7. The idea is quite trivial, the similarity 
is represented by A n B, the dissimilarity should corresponds to (AnBj U (inB). 
Therefore, the normalized similarity and dissimilarity can be defined as 



\AnB\ 
\AuB\ 



s 



< 



\(AnB)u(AnB)\ 



(4) 
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Figure 2: The comparison of modularity detected by GN algorithm and hierarchical 
clustering with Maximum minimum method. The peak are in 10 and 23 clusters re- 
spectively. 

In a way more convenient to be generalized to classification systems with more than 
only two sets, we can rewrite above expression by the characteristic mapping of set 
sign (X, to) , which is defined as following, 



sign(X,uj) = < . (5) 




This mapping from X to {1,0} can be very machinery calculated for any element oj in 
f2 and for any subset X. It's easy to check 

\A n B\ = J2uen si 9 n (A ^) sign (B, u) 

(6) 

\(AnB)u(Ar\B)\ = J2^en Wgn {A, u) - sign (B,u)\ 
And also 

\AuB\ = \AnB\ + \(AnB)u(AnB)\ (7) 

Therefore, by the characteristic mapping, the similarity and dissimilarity are reex- 
pressed by 

Eugn sign(A,uj)sign(B 



T.u 1 eni si 9 n ( A ^) si g n ( B ^)+\sign(A,u))-sign(B,uj)\] 

(8) 

. _ J2 u en\ s t-9n(A,x)-sign(B,u)\ 



Ecjgfl [sign(A,uj)sign(B,w) + \sign(A,uj)— sign(B,uj) \] 

Consider a particular division of a network into k communities. There are two 
formations of k communities by different algorithms, we can deduce the comparison of 
them into many pairs of comparison between sets. The principal way is: 

1. Construct the correspondence between the two subsets from different conditions 
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2. Compare every corresponding pair. 

3. At last, integrate all the results from comparison of every single pair. 

Correspondence relation here means, for every subset X{ in classification X^, to find 
the counterpart Yi in Y^, by the similarity measurement. Here Xi and Yi correspond to 
above A and B. After that we will get two ordered set X and Y, where the elements 
at the corresponding order are a pair of counterparts. And then apply the dissimilarity 
measure onto every pair to get a measurement of the total dissimilarity. 

Under this definition, d will be normalized in [0, 1], where (0, 1) means no and large 
difference respectively. 

The principle of the first step is to compare every single set Xi from X with all 
the Yi in Y, and group it with the one having largest similarity. However, at some 
cases, this may lead to a very ugly correspondence, for instance, many Xi correspond 
to the same Yi. at this time, we choose the largest one of them and group the Xi 
which correspond the largest one with the Yi. other Xi should found the counterpart 
again in rest Y{. But in some times, we want to discuss the different formations of 
communities, for example, we want to compare the dissimilar between the best division 
of GN algorithm and hierarchical clustering. As obtained above, we know that the 
number of communities are different, it meant some Yj in hierarchical clustering don't 
have any counterparts. In this case, the first step still can be done by treating the 
whole group as a large subset, and treating no counterpart as empty set 3>. The k 
equal to the larger number. Here we give two examples, the first one, a network, 
including N nodes, was divided into two communities by two algorithms. One division 
is two equal communities. The other division is a node and the rest nodes. Calculating 
the dissimilar of two algorithm, we got d ~ 0.75. The second example is a network, 
including N nodes, was divided into N communities, calculating the dissimilar of the 
whole network and the N communities, dwl. 

We use this algorithm to analyze the dissimilar of hierarchical clustering and GN 
algorithm, and the result was shown in figureGl With the same number of communities, 
the Q curves and dissimilarity D for the results from different algorithm are shown, we 
also focus on the dissimilar of best division of them, Db es t = 0.756, which means they 
are quite different. 

3.1 The influence of weight to the results of communities 

Now we turn to the effects of weight on the community structure of weighted networks. 
In [2H1, in order to study the impaction of weight to the topological properties of 
network, we have introduced the way to re-assign weights onto edge with p = 1,-1 for 
weighted networks. Set p = 1 represents the original weighted network given by the 
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Figure 3: The dissimilarity D of the results from hierarchical clustering and GN algo- 
rithm with same clusters. 
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Figure 4: the Q of different weights in GN algorithm 

ordered series of weights which gives the relation between weight and edge but in a 
decreasing order, 

W(p = 1) = (w hjl = w 1 > w i2h = w 2 >■■■> W( iL )(j L ) = w L ) . (10) 

p = — 1 is defined as the inverse order as 

W(p = -1) = (w hjl =w L <■■■< u» fe _ l)(ji _ l) = w 2 < w (iL){jL) = w 1 ) , (11) 

In this paper, we use the comparison of the communities which formed in non-weighted 
and re-assign weights onto edges with p = 1,-1 to show the influence of weight to the 
results of communities. 

We obtained the influence of weight to the results of communities from the function 
Q and the dissimilarity. Using GN algorithm to detect communities, the influence of 
weight were shown in figure IH51 In the figure 01 although the communities number of 
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Figure 5: A: The dissimilarity of non-weighted and weighted network in GN algorithm. 
The dissimilarity of the best division is: -Dio,i3 = 0.42. B: The dissimilarity of weighted 
and inverse weighted network in GN algorithm. The dissimilarity of the best division 
is: Dio,i2 = 0.25. 



best division in different weights are quite same, the components of each community are 
quite different. The same things happened in using hierarchical clustering to analyze 
the network. Comparing these figures, we found that the weight have bigger influence 
in GN algorithm. 



4 Concluding Remarks 



In this paper, we study the community structure of scientists collaboration network 
by using hierarchical clustering algorithm and the algorithm of GN. And we also pay 
much attention to the influence of the weight to results of communities. It has been 
found that GN algorithm gives better results. Scientists who are in the same university, 
institute or interested in similar research topic are clustered to one community. In order 
to study the topological role of the weight, we have introduced a measure to describe 
the difference of two kinds of communities. Then we investigate the different results of 
clustering for non-weighted, weighted, and inverse weighted networks. The weight do 
have influence on the formation of communities but it is not very significant for our 
network of econophysicits. We guess that maybe our network is a sparse network, so 
the existence or not of edges have bigger influence to community structure of networks 
than the weight. 
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